To efficiently resolve Domino Server performance issues and most effectively raise such issues to IBM Lotus Support, it's important to gather data in the right quantity and of the right quality. From basic problem description to detailed system statistics, taking the time to get it right will save time in the long run and establish a practice of system monitoring that will both detect and resolve issues quickly going forward.
There are two corporate groups that typically receive complaints and respond to instances of performance problems on Domino servers. They are those who work at help desks and those who administer Domino servers. While there is certainly overlap in the data these groups receive, there is enough difference to break them out individually in our discussion.
End User Help Desk
Domino server performance issues manifest themselves in a few easily identified ways on the Notes client:
a. Slow response or an hourglass - The duration of operations is most noticeable when it exceeds normal, where "normal" is often a soft, subjective perception, though it can represent a breach of a hard service level agreement. In several cases, end users can remain productive while a particular Domino server experiences delays, but in the few areas where the Notes client remains single threaded, the hourglass will replace the mouse cursor icon.
b. Inability to connect - The message received in this case is "
Remote system no longer responding". The server to which the user was attempting to connect can either be undergoing an outage due to a crash or it can be unavailable due to overload or excessive delay.
Of course people will communicate their experience and that is extremely valuable, initial information. But it is
only initial data and that is almost never enough to help troubleshoot or truly determine the nature of the problem as it is happening.
To hone in on some basic and helpful details, the answers to these following questions are important to gather while the experience is fresh in mind:
- "What were you doing at the time of the problem?" - Any description of mouse gestures, use of Domino objects like databases, views, folders or documents can be helpful. It is not that the actions may have triggered the problem, but it may be vital to know what actions are affected by the problem.
- "Can you do anything on the server?" - It may take a savvy user to know which particular server is being accessed during a problem, but walking a user through a CTRL + O, specifying server name, and names.nsf database open can give quick insight to the breadth of an issue.
- "Is it happening to everyone?" - While this will be answered in short order if complaints roll in, often a user has asked his/her colleagues if they are experiencing the same problem. Both answers "yes" and "no" are important to know. It is also true that his/her colleagues may or may not share the same server, and that needs to be known when asking the question.
- "Does it recover?" and "How often does this happen?" are key answers to know in order to troubleshoot performance issues. While sporadic problems are generally more difficult to diagnose and resolve, gathering the times and lengths of outages as early as possible can vastly optimize the diagnosis effort. If a problem is persistent, then instructing a user to turn on CLIENT_CLOCK=1 and capturing output can be extremely revealing.
Knowing the answers to these and similar questions best leverages the time spent on the phone with someone reporting a problem to the help desk.
Domino Administrators
No one is closer to Domino Server performance issues than the administrator. He/She is able to gather data about problems more quickly and accurately than any other type of person in the organization. The answers to vital questions that an administrator can provide come from analysis of data through tooling or empirical observation. So these can be seem redundant with those questions asked by the help desk, but the answers are both based upon hard (vs. soft) cumulative data, that is, different data than help desk personnel have access to. Some of the most important questions to be answered are:
- "What is the nature of the load on this server?" or "What is this server's purpose?" - This sounds obvious, and it may be readily known, but in an enterprise, the type or mix of types of load on a server may be the most important early piece of data to gather in problem determination.
- "Are there third party packages in the mix?" - Domino has a very rich and exposed set of APIs and the behavior and good-citizenship of third party packages using those interfaces can vary tremendously. Understanding the nature and timing of processing performed by and learning if any changes have taken place in the deployment of such packages can be instructive in the overall analysis.
- "Is the console responsive?" - This is a simple question with a "yes" or "no" answer that is instrumental in classifying an outage. A "no" to the this question may be a source of frustration, but in reality it is just an indicator of a different course of action in data gathering and analysis.
- "Did the server configuration change (hardware/software)?" - IBM support engineers often hear "no" when they ask this question, but an honest, well-informed answer to this can be the first key to resolution. Examples are:
- Template changes - mail, pubnames, application templates
- Version of Domino - What was the old version and what is the new one? What process was used in the migration? What new features were deployed?
- New home grown add-ins
- New or upgraded third party packages - What versions? Even changes in software unrelated to Domino can be important
- Hardware changes - New or changed disks, new servers with or without change in platform
- Conversion/Migration - What data has moved onto the server? What was the process to move it?
- What, if any, Operating System parameters changed?
- "Has there been a recent, known increase in load?" - Whether a group of users have recently failed over, a long-running agent or other process has been recently deployed or any of many other possible sources of resource-consuming load has been introduced, this is very important to know. Even if a component showed no ill effect in a test environment, the combination of its processing with production load can cause states that threaten availability.
- "Does the issue happen at a certain time of day?" - So much processing is machine-scheduled or just cyclical due to standard business practice that annotating the times and durations of problems can provide quick, telltale associations.
- "What are the server's 'vital stats' versus 'steady state'?" - CPU, disk usages, memory thresholds and network throughput are all necessary to obtain during a performance issue. It is important to record statistical variance from steady, fully loaded state into problem state in the description of a problem. In fact, tooling that responds to changes in such data can be extremely helpful in heading off issues before they become outages.
- "Across the enterprise, have there been new policies or practices introduced?" - Performance-threatening load can be introduced as an unforeseen side effect of new corporate standards, time zone-ignorant processing or reworked data flow. While no one would intend for corporate changes to cause such risk, many unforeseen side effects exist which do just that.
- "To what operations is the server not responding?" - Through the console, looking at statistics across time can indicate both health and operations that should be occurring but are not.
- If "show trans" issued several times across a 20-30 minute period (or less) during normal times of server activity shows no increase (that is, the first column = "COUNT") for transactions types such as OPEN_DB, DB_MODIFIED_TIME and OPEN_COLLECTION, the problem can be considered to be server-wide.
- If the console has messages like this:
LkMgr BEGIN Long Held Lock Dump
Lock(Mode=SIX* LockID(DB DB=/maildisk/mail1/bigshared.nsf)) Waiters countNonIntentLocks = 2 countIntentLocks = 0, queuLength = 4
Req(Status=Granted Mode=SIX Class=Manual Nest=0 Cnt=2
Tran=0 Func=N/A [1077448:00022-10281])
Req(Status=Waiting Mode=S Class=Manual Nest=0 Cnt=1
Tran=0 Func=N/A [1077448:00029-12080] Delay=2min)
Req(Status=Waiting Mode=S Class=Manual Nest=0 Cnt=1
Tran=0 Func=N/A [1077448:00027-11566] Delay=3min)
Req(Status=Waiting Mode=S Class=Manual Nest=0 Cnt=1
Tran=0 Func=N/A [1077448:00021-10024] Delay=5min)
Req(Status=Waiting Mode=S Class=Manual Nest=0 Cnt=1
Tran=0 Func=N/A [1077448:00031-12594] Delay=7min)
or if DEBUG_SHOW_TIMEOUT=1 and DEBUG_CAPTURE_TIMEOUT=1 have been enabled, and is a considerable, ongoing stream of message, the object being contended for may be the only object involved in the performance issue.
- "Are all users affected by this problem?" - Again, console output can indicate the connection and successful processing of users so if others have complained they can be perhaps categorized by their activity or Domino objects they are accessing. If "show user debug" is issued, the "Minutes Since Last Used" column can serve to indicate activity of specific users, particularly if they have lodged complaints.
- "Are there predominant, recurring error messages on the console?" - The unavailability of a central resource like names.nsf or other kinds of redundant failure such as out-of-memory conditions can be key to understanding what conditions existed that led up to an issue. These conditions can often be the root cause.
Coming soon - Best Practice Techniques for Data Gathering.